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Abstract. This work concerns estimation of multidimensional nonlinear 
regression models using multilayer perceptron (MLP). The main problem 
with such model is that we have to know the covariance matrix of the 
noise to get optimal estimator, however we show that, if we choose as cost 
function the logarithm of the determinant of the empirical error covariance 
matrix, we get an asymptotically optimal estimator. 

1 Introduction 

Let us consider a sequence (Yt, Zt) teN of i.i.d0 random vectors (i.e. identically 
distributed and independents). So, each couple (Yt, Zt) has the same law that a 
generic variable (Y, Z). Moreover, we assume that the model can be written 

Y t =F w0 {Z t ) + e t 

where 

• F w o is a function represented by a MLP with parameters or weights W°. 

• (st) is an i.i.d. centered noise with unknown invertible covariance matrix 

r . 

Our goal is to estimate the true parameter by minimizing an appropriate cost 
function. This model is called a regression model and a popular choice for the 
associated cost function is the mean square error : 

n 

where ||.|| denotes the Euclidean norm on W l . Although this function is widely 
used, it is easy to show that we get then a suboptimal estimator. An other 
solution is to use an approximation of the covariance error matrix to compute 
generalized least square estimator : 

1 ™ 

- V (Yt - F w (Z t ) f T- 1 (Yt - F w (Z t j) , 



1 It is not hard to extend all what we show in this paper for stationary mixing variables and 
so for time series 



where T denotes the transposition of the matrix. Here we assume that T is a good 
approximation of the true covariance matrix of the noise r . However it takes 
time to compute a good approximation of matrix Tq and it leads asymptotically 
to the cost function proposed in this article (see for example Rynkiewicz [3]) : 

U n (W) := logdet U f^Xt - F w (Z t ))(Y t - F w (Z t )) T ^j (1) 

This paper is devoted to the theoretical study of U n (W). We assume that 
the true architecture of the MLP is known so that the Hessian matrix computed 
in the sequel verifies the assumption to be definite positive (see Fukumizu [T]). 

In this framework, we study the asymptotic behavior W n := argmin[/„ (W), 
the weights minimizing the cost function U n (W). We show that under simple 
assumptions this estimator is asymptotically optimal in the sense that it has the 
same asymptotic behavior than the generalized least square estimator using the 
true covariance matrix of the noise. 

Numerical procedures to compute this estimator and examples of it behavior 
can be found in Rynkiewicz [?]. 

2 The first and second derivatives of W i — > U n (W) 

First, we introduce a notation : if Fyy(X) is a d-dimensional parametric function 

depending of a parameter W, let us write 9F q^^ (resp. 9 Qw£dw! ) ^ or ^~ 
dimensional vector of partial derivative (resp. second order partial derivatives) 
of each component of Fw(X). 

2.1 First derivatives 

Now, if T n (W) is a matrix depending of the parameter vector W, we get From 
Magnus and Neudecker [3] 

d 



lndet (T n (W)) = tr (^-\W)-^-T n {W)^ 



dW k 
here 

n 

r„(W0 = - V(y t - F w (z t ))(y t - F w {z t )) J 

n — ^ 



n ■ 
t=i 



note that these matrix r„(H / ) and it inverse are symmetric. Now, if we note 
using the fact 

tr {T-\W)A n [W k )) = tr {J%{W k )T?(W)) - tr (T-\W)A^(W k )) 
we get 

(> lndet (T n (W)) = 2tr(T~ 1 (W)A n (W k )) (2) 



dW, 



2.2 Second derivatives 

We write now 



B n (W k ,W t ) :=- V 

and 



1 / dF w (z t ) dF w {z t ) 



nfri\ dW k OWi 



We get 



UM = m 2tr ( r n 1 (w)A n (w k )) = 

2tr ( 9r ow W) A (W k )) + 2tr (T- 1 {W)B n (W k ,Wi)) + 2tr (r^Wj-^Wfc, W,)) 

Now, Magnus and Neudecker [3] give an analytic form of the derivative of an 
inverse matrix, so we get 

^gg = 2tr (T-i(W) (A n (W k ) + A T n {W k )) T^{W)A n {W k )) + 
2tr {T- 1 (W)B n {W k , Wi)) + 2tr (T^^C^Wk, Wi)) 

so 

= Atr (T-\W)A n (W k )T-HW)A n {W k )) (3 
+2tr (T- 1 (W)B n {W k ,Wi)) + 2tr (r- 1 (W)C„(W fc , Wi)) 

3 Asymptotic properties of W"„ 

First, following the same lines that Yao [5], it is easy to show that, if the noise of 
the model has a moment of order at least 2, the estimator is strongly consistent 
(i.e. W n ™' W°). 

Moreover, for a MLP function, there exists a constant C such that we have 
the following inequalities : 

II^T i ll<C'(l + ||Z||) 



dw k 

> 2 F W ( 
dW k dWi 



8 Fw( - Z h i < (7(1 + \\zf) 



2 F W (Z) 



iittosW - W^W <cw- + ll^ll 3 ) 

So, if Z has a moment of order at least 3 (see the justification in Yao [5]), we 
get the following lemma : 

Lemma 1 Let AU„(W°) be the gradient vector of U n (W) at W° , AU(W°) be 
the gradient vector of U(W) := logdet(F - F W (Z)) at W° and HU n (W°) be 
the Hessian matrix of U n (W) at W° . 



We define finally 



dW k 8Wi 

and 



MW l )=[- d -^^{Y-F w {Z)f 



We get then 

1. HU n (W°) 2I Q 

2. ^AU n (W ) L ^° JV(0,4/ ) 

where, the component (fc, I) of the matrix Iq is : 

tr(T^E(B{WlW?))) 

proof To prove the lemma, we remark first that the component (k,l) of the 
matrix 4/ is : 



E 



'I 

and, since the trace of the product is invariant by circular permutation 

p ( dU(W°) dU(W°) 

\ dw k aw? 



AE 
= AE 
= Atr 



(-%P^ r o '(Y Fw<Z)){Y F w o(Z)n^ (-^P)) 

( dF w o(Z) T v -i dF w0 (Z) \ 
\ dW k 1 dWi ) 



dF w0 (Z) dF wQ (Z) T 



dW k dWi 

Atr{T^E{B{WlW?))) 

Now, for the component (fc, I) of the expectation of the Hessian matrix, we 
remark that 

lim tr(r- 1 (W°)A n (W°)r- 1 (W°)A n (W°))=0 

n — > oo 

and 

lim trT- 1 C n (WlWl ) )=0 

n — >oo 

SO 

lim™ H n (W°) = lim^ Atr (T~' (W°)A n {W°)T-' (W°)A n {W°)) + 
2trT-\W°)B n {Wl + ZtrT^C^Wl W?) = 
= 2tr {T^E{B{WlW?))) 
■ 

From a classical argument of local asymptotic normality (see for example 
Yao [5]), we deduce then the following property for the estimator W n : 



Proposition 1 Let W* the estimator of the generalized least square : 



W* := argmin - V (Y t - F w (Z t )f T^ 1 (Y t - F w (Z t )) 
then we have 

km V^(W* - W°) = lim y/n(W n - W°) = A/(0, Iq 1 ) 

n — >oo n — >oo 

We remark that W n has the same asymptotic behavior than the estimator 
generalized least square estimator with the true covariance matrix T^" 1 which is 
asymptotically optimal (see for example Ljung [5]), so the proposed estimator is 
asymptotically optimal too. 

4 Conclusion 

In the linear multidimensional regression model the optimal estimator has an 
analytic solution (see Magnus and Neudecker [3]), so it doesn't make sense to 
consider minimization of a cost function. However, for the non- linear multidi- 
mensional regression model the ordinary least square estimator is sub-optimal if 
the covariance matrix of the noise is not the identity matrix. We can overcome 
this difficulty by using the cost function U n (W). The numerical computation 
and the empirical properties of these estimator have been studied in a previous 
article (see rynkiewicz [4]). In this paper, we have given a proof of the opti- 
mality of the estimator associated with U n (W). This is then a good choice for 
the estimation of multidimensional non-linear regression model with multilayer 
perceptron. 
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